pandas简介:pandas是强大的数据分析和处理工具
➢ 快速、灵活、富有表现力的数据结构: DataFrame数据框和Series系列
➢ 支持类似SQL的数据增、删、查、改
➢ 带有丰富的数据处理函数
➢ 支持时间序列分析功能
➢ 支持灵活处理缺失数据
➢ sep参数是指定文本的分隔符的,如果分隔符指定错误,在读取数据的时候,每一行数据将连成一片。
➢ header参数是用来指定列名的,如果是None则会添加一个默认的列名。
➢ encoding代表文件的编码格式,常用的编码有utf-8、utf-16、gbk、gb2312、gb18030等。如果编码
指定错误数据将无法读取,IPython解释器会报解析错误。
import pandas as pd
# 'meal_order_info.csv' 表示文件路径
# encoding='GBK' 指定编码格式
pd.read_csv('meal_order_info.csv',encoding='GBK').head()
| info_id | emp_id | number_consumers | mode | dining_table_id | dining_table_name | expenditure | dishes_count | accounts_payable | use_start_time | ... | lock_time | cashier_id | pc_id | order_number | org_id | print_doc_bill_num | lock_table_info | order_status | phone | name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 417 | 1442 | 4 | NaN | 1501 | 1022 | 165 | 5 | 165 | 2016/8/1 11:05:36 | ... | 2016/8/1 11:11:46 | NaN | NaN | NaN | 330 | NaN | NaN | 1 | 18688880641 | 苗宇怡 |
| 1 | 301 | 1095 | 3 | NaN | 1430 | 1031 | 321 | 6 | 321 | 2016/8/1 11:15:57 | ... | 2016/8/1 11:31:55 | NaN | NaN | NaN | 328 | NaN | NaN | 1 | 18688880174 | 赵颖 |
| 2 | 413 | 1147 | 6 | NaN | 1488 | 1009 | 854 | 15 | 854 | 2016/8/1 12:42:52 | ... | 2016/8/1 12:54:37 | NaN | NaN | NaN | 330 | NaN | NaN | 1 | 18688880276 | 徐毅凡 |
| 3 | 415 | 1166 | 4 | NaN | 1502 | 1023 | 466 | 10 | 466 | 2016/8/1 12:51:38 | ... | 2016/8/1 13:08:20 | NaN | NaN | NaN | 330 | NaN | NaN | 1 | 18688880231 | 张大鹏 |
| 4 | 392 | 1094 | 10 | NaN | 1499 | 1020 | 704 | 24 | 704 | 2016/8/1 12:58:44 | ... | 2016/8/1 13:07:16 | NaN | NaN | NaN | 330 | NaN | NaN | 1 | 18688880173 | 孙熙凯 |
5 rows × 21 columns
data=pd.read_csv('meal_order_info.csv',encoding="gbk")
# data
# index=None 表示取消编号
data.to_csv('meal_order_info_save_by_to_csv.csv',
index=None)
pd.read_excel('meal_order_detail.xlsx').head()
| detail_id | order_id | dishes_id | logicprn_name | parent_class_name | dishes_name | itemis_add | counts | amounts | cost | place_order_time | discount_amt | discount_reason | kick_back | add_inprice | add_info | bar_code | picture_file | emp_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | NaN | NaN | 蒜蓉生蚝 | 0 | 1 | 49 | NaN | 2016-08-01 11:05:36 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/104001.jpg | 1442 |
| 1 | 2958 | 417 | 609957 | NaN | NaN | 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n | 0 | 1 | 48 | NaN | 2016-08-01 11:07:07 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/202003.jpg | 1442 |
| 2 | 2961 | 417 | 609950 | NaN | NaN | 大蒜苋菜 | 0 | 1 | 30 | NaN | 2016-08-01 11:07:40 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/303001.jpg | 1442 |
| 3 | 2966 | 417 | 610038 | NaN | NaN | 芝麻烤紫菜 | 0 | 1 | 25 | NaN | 2016-08-01 11:11:11 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/105002.jpg | 1442 |
| 4 | 2968 | 417 | 610003 | NaN | NaN | 蒜香包 | 0 | 1 | 13 | NaN | 2016-08-01 11:11:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/503002.jpg | 1442 |
data=pd.read_excel('meal_order_detail.xlsx').head(10)
# data
data.to_excel('meal_order_detail_save_by_to_excel.xlsx',
index=None)
Series是一个一维序列,相当于python中的list
pd.Series(['Jason','Jackson','Black','Jack'])
0 Jason 1 Jackson 2 Black 3 Jack dtype: object
pd.Series([['Jason','Jsckson','Jack','Black'],[1,2,3,4]])
0 [Jason, Jsckson, Jack, Black] 1 [1, 2, 3, 4] dtype: object
DataFrame是一个二维序列
import numpy as np
data=np.random.randint(1,9,16).reshape(4,4)
# data
index=['AI','BI','CI','DI']
columns=['AC','BC','CC','DC']
# index 表示行索引
# columns 表示列索引
pd.DataFrame(data,index=index,columns=columns)
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 7 | 6 | 5 | 1 |
| BI | 8 | 8 | 4 | 1 |
| CI | 4 | 5 | 8 | 7 |
| DI | 2 | 4 | 8 | 8 |
# pd.DataFrame?
data=np.random.randint(1,9,16).reshape(4,4)
# data
index=['AI','BI','CI','DI']
data=pd.DataFrame(data,index=index,columns=columns)
data
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 2 | 1 | 1 | 5 |
| BI | 1 | 8 | 4 | 4 |
| CI | 6 | 6 | 6 | 7 |
| DI | 4 | 3 | 6 | 3 |
data.values
array([[2, 1, 1, 5],
[1, 8, 4, 4],
[6, 6, 6, 7],
[4, 3, 6, 3]])
data.index
Index(['AI', 'BI', 'CI', 'DI'], dtype='object')
data.columns
Index(['AC', 'BC', 'CC', 'DC'], dtype='object')
data.dtypes
AC int32 BC int32 CC int32 DC int32 dtype: object
data.size
16
data.ndim
2
data.shape
(4, 4)
CRUD(create、read、update、delete)
常见操作
head、tail、iloc(index local)、loc(local)
data=np.random.randint(1,9,16).reshape(4,4)
# data
index=['AI','BI','CI','DI']
data=pd.DataFrame(data,index=index,columns=columns)
data
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 3 | 7 | 8 | 7 |
| BI | 2 | 2 | 8 | 3 |
| CI | 2 | 8 | 2 | 5 |
| DI | 1 | 7 | 1 | 3 |
# 默认访问前5行
data.head(2)
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 3 | 7 | 8 | 7 |
| BI | 2 | 2 | 8 | 3 |
# 默认访问后5行
data.tail(2)
| AC | BC | CC | DC | |
|---|---|---|---|---|
| CI | 2 | 8 | 2 | 5 |
| DI | 1 | 7 | 1 | 3 |
data
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 3 | 7 | 8 | 7 |
| BI | 2 | 2 | 8 | 3 |
| CI | 2 | 8 | 2 | 5 |
| DI | 1 | 7 | 1 | 3 |
# 通过行列索引访问元素
data.iloc[1,1]
2
data.iloc[2:,2]
CI 2 DI 1 Name: CC, dtype: int32
# 通过行列名称访问元素
data.loc['BI','BC']
2
data.loc['BI':,'AI':]
| BC | CC | DC | |
|---|---|---|---|
| BI | 2 | 8 | 3 |
| CI | 8 | 2 | 5 |
| DI | 7 | 1 | 3 |
在数据访问的基础上,重新赋值
index=['AI','BI','CI','DI']
columns=['AC','BC','CC','DC']
data=pd.DataFrame(
np.random.randint(1,8,16).reshape(4,4),
index=index,
columns=columns
)
data.loc['AI','AC']=99
data
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 99 | 7 | 1 | 7 |
| BI | 6 | 5 | 2 | 6 |
| CI | 5 | 5 | 5 | 3 |
| DI | 4 | 4 | 4 | 7 |
data.loc['BI','AC':]=[1,2,3,4]
data
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 99 | 7 | 1 | 7 |
| BI | 1 | 2 | 3 | 4 |
| CI | 5 | 5 | 5 | 3 |
| DI | 4 | 4 | 4 | 7 |
data.loc['CI','AC':]=np.nan
data
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 99.0 | 7.0 | 1.0 | 7.0 |
| BI | 1.0 | 2.0 | 3.0 | 4.0 |
| CI | NaN | NaN | NaN | NaN |
| DI | 4.0 | 4.0 | 4.0 | 7.0 |
data
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 99.0 | 7.0 | 1.0 | 7.0 |
| BI | 1.0 | 2.0 | 3.0 | 4.0 |
| CI | NaN | NaN | NaN | NaN |
| DI | 4.0 | 4.0 | 4.0 | 7.0 |
data['EC']=13
data
| AC | BC | CC | DC | EC | |
|---|---|---|---|---|---|
| AI | 99.0 | 7.0 | 1.0 | 7.0 | 13 |
| BI | 1.0 | 2.0 | 3.0 | 4.0 | 13 |
| CI | NaN | NaN | NaN | NaN | 13 |
| DI | 4.0 | 4.0 | 4.0 | 7.0 | 13 |
data
| AC | BC | CC | DC | EC | |
|---|---|---|---|---|---|
| AI | 99.0 | 7.0 | 1.0 | 7.0 | 13 |
| BI | 1.0 | 2.0 | 3.0 | 4.0 | 13 |
| CI | NaN | NaN | NaN | NaN | 13 |
| DI | 4.0 | 4.0 | 4.0 | 7.0 | 13 |
# axis 来表示删除的是行还是列
data.drop('EC',axis=1)
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 99.0 | 7.0 | 1.0 | 7.0 |
| BI | 1.0 | 2.0 | 3.0 | 4.0 |
| CI | NaN | NaN | NaN | NaN |
| DI | 4.0 | 4.0 | 4.0 | 7.0 |
data.drop('CI',axis=0)
| AC | BC | CC | DC | EC | |
|---|---|---|---|---|---|
| AI | 99.0 | 7.0 | 1.0 | 7.0 | 13 |
| BI | 1.0 | 2.0 | 3.0 | 4.0 | 13 |
| DI | 4.0 | 4.0 | 4.0 | 7.0 | 13 |
data=np.random.randint(1,9,16)
index=['AI','BI','CI','DI']
columns=['AC','BC','CC','DC']
data=pd.DataFrame(data.reshape(4,4),index=index,
columns=columns)
data
| AC | BC | CC | DC | |
|---|---|---|---|---|
| AI | 3 | 1 | 3 | 8 |
| BI | 1 | 4 | 6 | 6 |
| CI | 2 | 5 | 5 | 5 |
| DI | 6 | 2 | 8 | 8 |
data.min()
AC 1 BC 1 CC 3 DC 5 dtype: int32
data.mean()
AC 3.00 BC 3.00 CC 5.50 DC 6.75 dtype: float64
data.max()
AC 6 BC 5 CC 8 DC 8 dtype: int32
data.median()
AC 2.5 BC 3.0 CC 5.5 DC 7.0 dtype: float64
data.std()
AC 2.160247 BC 1.825742 CC 2.081666 DC 1.500000 dtype: float64
data.count()
AC 4 BC 4 CC 4 DC 4 dtype: int64
data.mode()
| AC | BC | CC | DC | |
|---|---|---|---|---|
| 0 | 1 | 1 | 3 | 8.0 |
| 1 | 2 | 2 | 5 | NaN |
| 2 | 3 | 4 | 6 | NaN |
| 3 | 6 | 5 | 8 | NaN |
data.describe()
| AC | BC | CC | DC | |
|---|---|---|---|---|
| count | 4.000000 | 4.000000 | 4.000000 | 4.00 |
| mean | 3.000000 | 3.000000 | 5.500000 | 6.75 |
| std | 2.160247 | 1.825742 | 2.081666 | 1.50 |
| min | 1.000000 | 1.000000 | 3.000000 | 5.00 |
| 25% | 1.750000 | 1.750000 | 4.500000 | 5.75 |
| 50% | 2.500000 | 3.000000 | 5.500000 | 7.00 |
| 75% | 3.750000 | 4.250000 | 6.500000 | 8.00 |
| max | 6.000000 | 5.000000 | 8.000000 | 8.00 |
data['CC'].value_counts()
3 1 6 1 5 1 8 1 Name: CC, dtype: int64
Timestamp作为时间类中最基础的,也是最为常用的类型,在多数情况下,时间相关的字符串都会转换成为 Timestamp。pandas提供了to_datetime函数,能够实现这一目标。
DatetimeIndex与PeriodIndex函数
➢ 除了将数据字原始DataFrame中直接转换为Timestamp格式外,还可以将数据单独提取出来将其转换为DatetimeIndex或者PeriodIndex。
➢ 转换为PeriodIndex的时候需要注意,需要通过freq参数指定时间间隔,常用的时间间隔有Y为年,M为月,D为日,H为小时,T为分钟,S为秒。两个函数可以用来转换数据还可以用来创建时间序列数据,其参数非常类似。
data=pd.read_csv('meal_order_info.csv',encoding='gbk')
data['lock_time'].head()
0 2016/8/1 11:11:46 1 2016/8/1 11:31:55 2 2016/8/1 12:54:37 3 2016/8/1 13:08:20 4 2016/8/1 13:07:16 Name: lock_time, dtype: object
pd.to_datetime(data['lock_time']).head()
0 2016-08-01 11:11:46 1 2016-08-01 11:31:55 2 2016-08-01 12:54:37 3 2016-08-01 13:08:20 4 2016-08-01 13:07:16 Name: lock_time, dtype: datetime64[ns]
data_index=data
data_index['lock_time'].head()
0 2016/8/1 11:11:46 1 2016/8/1 11:31:55 2 2016/8/1 12:54:37 3 2016/8/1 13:08:20 4 2016/8/1 13:07:16 Name: lock_time, dtype: object
DatetimeIndex与PeriodIndex函数
➢ 除了将数据字原始DataFrame中直接转换为Timestamp格式外,还可以将数据单独提取出来将其转换为
DatetimeIndex或者PeriodIndex。
➢ 转换为PeriodIndex的时候需要注意,需要通过freq参数指定时间间隔,常用的时间间隔有Y为年,M为
月,D为日,H为小时,T为分钟,S为秒。两个函数可以用来转换数据还可以用来创建时间序列数据,其
参数非常类似。
pd.DatetimeIndex(data_index['lock_time'])
DatetimeIndex(['2016-08-01 11:11:46', '2016-08-01 11:31:55',
'2016-08-01 12:54:37', '2016-08-01 13:08:20',
'2016-08-01 13:07:16', '2016-08-01 13:23:42',
'2016-08-01 13:34:18', '2016-08-01 13:50:16',
'2016-08-01 17:18:20', '2016-08-01 17:44:27',
...
'2016-08-31 18:18:31', '2016-08-31 18:42:42',
'2016-08-31 18:57:56', '2016-08-31 19:19:31',
'2016-08-31 20:39:13', '2016-08-31 21:31:48',
'2016-08-31 21:56:12', '2016-08-31 21:33:34',
'2016-08-31 21:55:39', '2016-08-31 21:32:56'],
dtype='datetime64[ns]', name='lock_time', length=945, freq=None)
# freq 可选时间
# freq : str or period object, optional
# One of pandas period strings or corresponding objects.
# year : int, array, or Series, default None
# month : int, array, or Series, default None
# quarter : int, array, or Series, default None
# day : int, array, or Series, default None
# hour : int, array, or Series, default None
# minute : int, array, or Series, default None
# second : int, array, or Series, default None
# dtype : str or PeriodDtype, default None
# freq='s' 精确到秒
data_period=data
pd.PeriodIndex(data_period['lock_time'],freq='h')
PeriodIndex(['2016-08-01 11:00', '2016-08-01 11:00', '2016-08-01 12:00',
'2016-08-01 13:00', '2016-08-01 13:00', '2016-08-01 13:00',
'2016-08-01 13:00', '2016-08-01 13:00', '2016-08-01 17:00',
'2016-08-01 17:00',
...
'2016-08-31 18:00', '2016-08-31 18:00', '2016-08-31 18:00',
'2016-08-31 19:00', '2016-08-31 20:00', '2016-08-31 21:00',
'2016-08-31 21:00', '2016-08-31 21:00', '2016-08-31 21:00',
'2016-08-31 21:00'],
dtype='period[H]', name='lock_time', length=945)
# pd.PeriodIndex?
data=pd.read_csv('meal_order_info.csv',encoding='gbk')
data=pd.to_datetime(data['lock_time'])
data
0 2016-08-01 11:11:46
1 2016-08-01 11:31:55
2 2016-08-01 12:54:37
3 2016-08-01 13:08:20
4 2016-08-01 13:07:16
...
940 2016-08-31 21:31:48
941 2016-08-31 21:56:12
942 2016-08-31 21:33:34
943 2016-08-31 21:55:39
944 2016-08-31 21:32:56
Name: lock_time, Length: 945, dtype: datetime64[ns]
data[0].year,data[0].month,data[0].day
(2016, 8, 1)
data[0].hour,data[0].minute,data[0].second
(11, 11, 46)
# 获取所有数据的时间
data.dt.year.head()
0 2016.0 1 2016.0 2 2016.0 3 2016.0 4 2016.0 Name: lock_time, dtype: float64
# 获取所有数据的时间
data.dt.month.head()
0 8.0 1 8.0 2 8.0 3 8.0 4 8.0 Name: lock_time, dtype: float64
Timedelta类
➢ 使用Timedelta ,可以很轻松地实现在某个时间上加减一段时间 。
➢ 能够直接对两个时间序列进行相减,从而得出一个Timedelta。
data=pd.read_csv('meal_order_info.csv',encoding='gbk')
data=pd.to_datetime(data['lock_time'])
data
0 2016-08-01 11:11:46
1 2016-08-01 11:31:55
2 2016-08-01 12:54:37
3 2016-08-01 13:08:20
4 2016-08-01 13:07:16
...
940 2016-08-31 21:31:48
941 2016-08-31 21:56:12
942 2016-08-31 21:33:34
943 2016-08-31 21:55:39
944 2016-08-31 21:32:56
Name: lock_time, Length: 945, dtype: datetime64[ns]
# 时间运算
# weeks, days, hours, minutes,
# seconds, milliseconds, microseconds,
# nanoseconds
data[1]+pd.Timedelta(weeks=3)
Timestamp('2016-08-22 11:31:55')
by参数的特别说明
➢ 如果传入的是一个函数则对索引进行计算并分组。
➢ 如果传入的是一个字典或者Series则字典或者Series的值用来做分组依据。
➢ 如果传入一个NumPy数组则数据的元素作为分组依据。
➢ 如果传入的是字符串或者字符串列表则使用这些字符串所代表的字段作为分组依据。
data=pd.read_excel('meal_order_detail.xlsx')
data.head()
| detail_id | order_id | dishes_id | logicprn_name | parent_class_name | dishes_name | itemis_add | counts | amounts | cost | place_order_time | discount_amt | discount_reason | kick_back | add_inprice | add_info | bar_code | picture_file | emp_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | NaN | NaN | 蒜蓉生蚝 | 0 | 1 | 49 | NaN | 2016-08-01 11:05:36 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/104001.jpg | 1442 |
| 1 | 2958 | 417 | 609957 | NaN | NaN | 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n | 0 | 1 | 48 | NaN | 2016-08-01 11:07:07 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/202003.jpg | 1442 |
| 2 | 2961 | 417 | 609950 | NaN | NaN | 大蒜苋菜 | 0 | 1 | 30 | NaN | 2016-08-01 11:07:40 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/303001.jpg | 1442 |
| 3 | 2966 | 417 | 610038 | NaN | NaN | 芝麻烤紫菜 | 0 | 1 | 25 | NaN | 2016-08-01 11:11:11 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/105002.jpg | 1442 |
| 4 | 2968 | 417 | 610003 | NaN | NaN | 蒜香包 | 0 | 1 | 13 | NaN | 2016-08-01 11:11:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/503002.jpg | 1442 |
group_data=data[['order_id','counts','amounts']].groupby(by='order_id')
group_data
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028C5ECE0370>
# 以列表方式传入参数
group_data.agg(['min','max']).head()
| counts | amounts | |||
|---|---|---|---|---|
| min | max | min | max | |
| order_id | ||||
| 137 | 1 | 4 | 1 | 99 |
| 165 | 1 | 2 | 9 | 178 |
| 166 | 1 | 2 | 6 | 109 |
| 171 | 1 | 4 | 10 | 65 |
| 177 | 1 | 1 | 16 | 48 |
# 以键值对的形式传入
group_data.agg({'counts':['min','max'],'amounts':'mean'}).head()
| counts | amounts | ||
|---|---|---|---|
| min | max | mean | |
| order_id | |||
| 137 | 1 | 4 | 32.333333 |
| 165 | 1 | 2 | 52.944444 |
| 166 | 1 | 2 | 48.200000 |
| 171 | 1 | 4 | 36.285714 |
| 177 | 1 | 1 | 34.250000 |
# 只能使用一个函数、且只能作用于整个数据框
group_data.apply('mean').head()
| counts | amounts | |
|---|---|---|
| order_id | ||
| 137 | 1.500000 | 32.333333 |
| 165 | 1.166667 | 52.944444 |
| 166 | 1.400000 | 48.200000 |
| 171 | 1.428571 | 36.285714 |
| 177 | 1.000000 | 34.250000 |
# group_data.apply?
使用povit_table函数创建透视表
➢ 在不特殊指定聚合函数aggfunc时,会默认使用numpy.mean进行聚合运算,numpy.mean会自动过滤
掉非数值类型数据。可以通过指定aggfunc参数修改聚合函数。
➢ 和groupby方法分组的时候相同,pivot_table函数在创建透视表的时候分组键index可以有多个。
➢ 通过设置columns参数可以指定列分组。
➢ 当全部数据列数很多时,若只想要显示某列,可以通过指定values参数来实现。
➢ 当某些数据不存在时,会自动填充NaN,因此可以指定fill_value参数,表示当存在缺失值时,以指定数
值进行填充。
➢ 可以更改margins参数,查看汇总数据。
data=pd.read_excel('meal_order_detail.xlsx')
data.head()
| detail_id | order_id | dishes_id | logicprn_name | parent_class_name | dishes_name | itemis_add | counts | amounts | cost | place_order_time | discount_amt | discount_reason | kick_back | add_inprice | add_info | bar_code | picture_file | emp_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | NaN | NaN | 蒜蓉生蚝 | 0 | 1 | 49 | NaN | 2016-08-01 11:05:36 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/104001.jpg | 1442 |
| 1 | 2958 | 417 | 609957 | NaN | NaN | 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n | 0 | 1 | 48 | NaN | 2016-08-01 11:07:07 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/202003.jpg | 1442 |
| 2 | 2961 | 417 | 609950 | NaN | NaN | 大蒜苋菜 | 0 | 1 | 30 | NaN | 2016-08-01 11:07:40 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/303001.jpg | 1442 |
| 3 | 2966 | 417 | 610038 | NaN | NaN | 芝麻烤紫菜 | 0 | 1 | 25 | NaN | 2016-08-01 11:11:11 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/105002.jpg | 1442 |
| 4 | 2968 | 417 | 610003 | NaN | NaN | 蒜香包 | 0 | 1 | 13 | NaN | 2016-08-01 11:11:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/503002.jpg | 1442 |
# data[['order_id','counts','dishes_name']] 表示数据
# index='order_id' 行索引
# columns='dishes_name' 列索引
# aggfunc='count' 应用函数
# fill_value=0 设置默认值为0
pd.pivot_table(data[['order_id','counts','dishes_name']],
index='order_id',columns='dishes_name',
aggfunc='count',fill_value=0).head()
| counts | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dishes_name | 42度海之蓝 | 北冰洋汽水 | 38度剑南春 | 50度古井贡酒 | 52度泸州老窖 | 53度茅台 | 一品香酥藕 | 三丝鳝鱼 | 三色凉拌手撕兔 | 不加一滴油的酸奶蛋糕 | ... | 香辣腐乳炒虾 | 香酥两吃大虾 | 鱼香肉丝拌面 | 鲜美鳝鱼 | 鸡蛋、肉末肠粉 | 麻辣小龙虾 | 黄尾袋鼠西拉子红葡萄酒 | 黄油曲奇饼干 | 黄花菜炒木耳 | 黑米恋上葡萄 |
| order_id | |||||||||||||||||||||
| 137 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 165 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 166 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 171 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 177 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 154 columns
➢ 交叉表是一种特殊的透视表,主要用于计算分组频率。
➢ 由于交叉表是透视表的一种,其参数基本保持一致,不同之处在于crosstab函数中的index,columns,
values填入的都是对应的从Dataframe中取出的某一列。
➢ pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None,
margins=False, dropna=True, normalize=False)
data.head()
| detail_id | order_id | dishes_id | logicprn_name | parent_class_name | dishes_name | itemis_add | counts | amounts | cost | place_order_time | discount_amt | discount_reason | kick_back | add_inprice | add_info | bar_code | picture_file | emp_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | NaN | NaN | 蒜蓉生蚝 | 0 | 1 | 49 | NaN | 2016-08-01 11:05:36 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/104001.jpg | 1442 |
| 1 | 2958 | 417 | 609957 | NaN | NaN | 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n | 0 | 1 | 48 | NaN | 2016-08-01 11:07:07 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/202003.jpg | 1442 |
| 2 | 2961 | 417 | 609950 | NaN | NaN | 大蒜苋菜 | 0 | 1 | 30 | NaN | 2016-08-01 11:07:40 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/303001.jpg | 1442 |
| 3 | 2966 | 417 | 610038 | NaN | NaN | 芝麻烤紫菜 | 0 | 1 | 25 | NaN | 2016-08-01 11:11:11 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/105002.jpg | 1442 |
| 4 | 2968 | 417 | 610003 | NaN | NaN | 蒜香包 | 0 | 1 | 13 | NaN | 2016-08-01 11:11:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/503002.jpg | 1442 |
# index=data['order_id'], 设置行
# columns=data['dishes_name'], 设置列
# values=data['counts'], 设置值
# aggfunc='count') 设置求值函数
# fillna(0) 填充确实值
pd.crosstab(index=data['order_id'],columns=data['dishes_name'],
values=data['counts'],aggfunc='count').fillna(0).head()
| dishes_name | 42度海之蓝 | 北冰洋汽水 | 38度剑南春 | 50度古井贡酒 | 52度泸州老窖 | 53度茅台 | 一品香酥藕 | 三丝鳝鱼 | 三色凉拌手撕兔 | 不加一滴油的酸奶蛋糕 | ... | 香辣腐乳炒虾 | 香酥两吃大虾 | 鱼香肉丝拌面 | 鲜美鳝鱼 | 鸡蛋、肉末肠粉 | 麻辣小龙虾 | 黄尾袋鼠西拉子红葡萄酒 | 黄油曲奇饼干 | 黄花菜炒木耳 | 黑米恋上葡萄 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| order_id | |||||||||||||||||||||
| 137 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 165 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
| 166 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 171 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 177 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 154 columns
import pandas as pd
import numpy as np
# np.nan 表示缺失值
df1 = dict({'name':['Jackosn-1','Jason-1','Black-1',
np.nan,'Lucy-1',np.nan],
'score':[90, 60, 50, 100, np.nan, 99]})
df2 = dict({'name':['Jackosn-2','Jason-2','Black-2',
np.nan,'Lucy-2',np.nan],
'score':[90, 60, 50, 100, np.nan, 99]})
data1 = pd.DataFrame(df1)
data2 = pd.DataFrame(df2)
data1,data2
( name score
0 Jackosn-1 90.0
1 Jason-1 60.0
2 Black-1 50.0
3 NaN 100.0
4 Lucy-1 NaN
5 NaN 99.0,
name score
0 Jackosn-2 90.0
1 Jason-2 60.0
2 Black-2 50.0
3 NaN 100.0
4 Lucy-2 NaN
5 NaN 99.0)
# axis=1 表示横向堆叠, axis=0表示纵向堆叠
pd.concat([data1,data2],axis=1)
| name | score | name | score | |
|---|---|---|---|---|
| 0 | Jackosn-1 | 90.0 | Jackosn-2 | 90.0 |
| 1 | Jason-1 | 60.0 | Jason-2 | 60.0 |
| 2 | Black-1 | 50.0 | Black-2 | 50.0 |
| 3 | NaN | 100.0 | NaN | 100.0 |
| 4 | Lucy-1 | NaN | Lucy-2 | NaN |
| 5 | NaN | 99.0 | NaN | 99.0 |
data1['add']=11
data1
| name | score | add | |
|---|---|---|---|
| 0 | Jackosn-1 | 90.0 | 11 |
| 1 | Jason-1 | 60.0 | 11 |
| 2 | Black-1 | 50.0 | 11 |
| 3 | NaN | 100.0 | 11 |
| 4 | Lucy-1 | NaN | 11 |
| 5 | NaN | 99.0 | 11 |
# join='inner' 表示求交集
pd.concat([data1,data2],join='inner')
| name | score | |
|---|---|---|
| 0 | Jackosn-1 | 90.0 |
| 1 | Jason-1 | 60.0 |
| 2 | Black-1 | 50.0 |
| 3 | NaN | 100.0 |
| 4 | Lucy-1 | NaN |
| 5 | NaN | 99.0 |
| 0 | Jackosn-2 | 90.0 |
| 1 | Jason-2 | 60.0 |
| 2 | Black-2 | 50.0 |
| 3 | NaN | 100.0 |
| 4 | Lucy-2 | NaN |
| 5 | NaN | 99.0 |
# join='outer' 表示求并集
pd.concat([data1,data2],join='outer')
| name | score | add | |
|---|---|---|---|
| 0 | Jackosn-1 | 90.0 | 11.0 |
| 1 | Jason-1 | 60.0 | 11.0 |
| 2 | Black-1 | 50.0 | 11.0 |
| 3 | NaN | 100.0 | 11.0 |
| 4 | Lucy-1 | NaN | 11.0 |
| 5 | NaN | 99.0 | 11.0 |
| 0 | Jackosn-2 | 90.0 | NaN |
| 1 | Jason-2 | 60.0 | NaN |
| 2 | Black-2 | 50.0 | NaN |
| 3 | NaN | 100.0 | NaN |
| 4 | Lucy-2 | NaN | NaN |
| 5 | NaN | 99.0 | NaN |
data1,data2
( name score add
0 Jackosn-1 90.0 11
1 Jason-1 60.0 11
2 Black-1 50.0 11
3 NaN 100.0 11
4 Lucy-1 NaN 11
5 NaN 99.0 11,
name score
0 Jackosn-2 90.0
1 Jason-2 60.0
2 Black-2 50.0
3 NaN 100.0
4 Lucy-2 NaN
5 NaN 99.0)
data=data1.drop('add',axis=1)
data
| name | score | |
|---|---|---|
| 0 | Jackosn-1 | 90.0 |
| 1 | Jason-1 | 60.0 |
| 2 | Black-1 | 50.0 |
| 3 | NaN | 100.0 |
| 4 | Lucy-1 | NaN |
| 5 | NaN | 99.0 |
data.append(data2)
C:\Users\86195\AppData\Local\Temp\ipykernel_3676\8019119.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. data.append(data2)
| name | score | |
|---|---|---|
| 0 | Jackosn-1 | 90.0 |
| 1 | Jason-1 | 60.0 |
| 2 | Black-1 | 50.0 |
| 3 | NaN | 100.0 |
| 4 | Lucy-1 | NaN |
| 5 | NaN | 99.0 |
| 0 | Jackosn-2 | 90.0 |
| 1 | Jason-2 | 60.0 |
| 2 | Black-2 | 50.0 |
| 3 | NaN | 100.0 |
| 4 | Lucy-2 | NaN |
| 5 | NaN | 99.0 |
data1=pd.read_csv('order_sample.csv')
data2=pd.read_csv('dishes_info.csv')
data1.head(),data2.head()
( detail_id order_id dishes_id place_order_time
0 2956 417 610062 2016-08-01 11:05:36
1 2958 417 609957 2016-08-01 11:07:07
2 2961 417 609950 2016-08-01 11:07:40
3 2966 417 610038 2016-08-01 11:11:11
4 2968 417 610003 2016-08-01 11:11:30,
dishes_id dishes_name amounts
0 610062 蒜蓉生蚝 49
1 609957 蒙古烤羊腿\r\n\r\n\r\n 48
2 609950 大蒜苋菜 30
3 610038 芝麻烤紫菜 25
4 610003 蒜香包 13)
# left 表示左表, right 表示右表
# how='left' 表示向左表看齐,右表出现值的缺失则按照左表自动补齐
# on 当主键名称一致时使用,
# 主键不一致时,需要分别指定个表的主键:left_on right_on
pd.merge(left=data1,right=data2,on='dishes_id').head()
| detail_id | order_id | dishes_id | place_order_time | dishes_name | amounts | |
|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | 2016-08-01 11:05:36 | 蒜蓉生蚝 | 49 |
| 1 | 3574 | 467 | 610062 | 2016-08-01 19:49:48 | 蒜蓉生蚝 | 49 |
| 2 | 893 | 171 | 610062 | 2016-08-02 13:55:27 | 蒜蓉生蚝 | 49 |
| 3 | 5691 | 673 | 610062 | 2016-08-04 20:38:09 | 蒜蓉生蚝 | 49 |
| 4 | 4626 | 578 | 610062 | 2016-08-05 21:35:29 | 蒜蓉生蚝 | 49 |
# on='dishes_id', 指定合并的键
# lsuffix='dishes_id', 添加左表的后缀
# rsuffix='dishes_name 添加右表的后缀
data1.join(data2,on='dishes_id',
lsuffix='_add_l',
rsuffix='_add_r').head()
| detail_id | order_id | dishes_id_add_l | place_order_time | dishes_id_add_r | dishes_name | amounts | |
|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | 2016-08-01 11:05:36 | NaN | NaN | NaN |
| 1 | 2958 | 417 | 609957 | 2016-08-01 11:07:07 | NaN | NaN | NaN |
| 2 | 2961 | 417 | 609950 | 2016-08-01 11:07:40 | NaN | NaN | NaN |
| 3 | 2966 | 417 | 610038 | 2016-08-01 11:11:11 | NaN | NaN | NaN |
| 4 | 2968 | 417 | 610003 | 2016-08-01 11:11:30 | NaN | NaN | NaN |
# np.nan 表示缺失值
df1 = dict({'name':['Jackosn-1','Jason-1','Black-1',
np.nan,'Lucy-1',np.nan],
'score':[90, 60, 50, 100, np.nan, 99]})
df2 = dict({'name':['Jackosn-2','Jason-2','Black-2',
np.nan,'Lucy-2',np.nan],
'score':[90, 60, 50, 100, np.nan, 99]})
data1 = pd.DataFrame(df1)
data2 = pd.DataFrame(df2)
data1,data2
( name score
0 Jackosn-1 90.0
1 Jason-1 60.0
2 Black-1 50.0
3 NaN 100.0
4 Lucy-1 NaN
5 NaN 99.0,
name score
0 Jackosn-2 90.0
1 Jason-2 60.0
2 Black-2 50.0
3 NaN 100.0
4 Lucy-2 NaN
5 NaN 99.0)
data1.combine_first(data2)
| name | score | |
|---|---|---|
| 0 | Jackosn-1 | 90.0 |
| 1 | Jason-1 | 60.0 |
| 2 | Black-1 | 50.0 |
| 3 | NaN | 100.0 |
| 4 | Lucy-1 | NaN |
| 5 | NaN | 99.0 |
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data=pd.read_csv('detail_duplicates.csv')
data.head()
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 0 | NaN | 蒜蓉生蚝 | 1.0 | 49.0 |
| 1 | NaN | NaN | 1.0 | 48.0 |
| 2 | 417.0 | 大蒜苋菜 | 1.0 | 30.0 |
| 3 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 4 | 417.0 | 蒜香包 | 1.0 | 13.0 |
# subset 设置去重列,可以出入多个数据
# keep 表示当出现重复数据时,去除第一个还是最后一个元素
# inplace 是否作用于源数据 默认为False
data.drop_duplicates(subset=['order_id'],
keep='last',
inplace=False)
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 5 | NaN | NaN | NaN | NaN |
| 9 | 301.0 | 番茄有机花菜 | 1.0 | 32.0 |
| 10 | 417.0 | 蒙古烤羊腿\r\n\r\n\r\n | 1.0 | 48.0 |
# subset 传入多个重复列
data.drop_duplicates(subset=['order_id','dishes_name'])
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 0 | NaN | 蒜蓉生蚝 | 1.0 | 49.0 |
| 1 | NaN | NaN | 1.0 | 48.0 |
| 2 | 417.0 | 大蒜苋菜 | 1.0 | 30.0 |
| 3 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 4 | 417.0 | 蒜香包 | 1.0 | 13.0 |
| 6 | 301.0 | 香烤牛排\r\n | 1.0 | 55.0 |
| 8 | 301.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 9 | 301.0 | 番茄有机花菜 | 1.0 | 32.0 |
| 10 | 417.0 | 蒙古烤羊腿\r\n\r\n\r\n | 1.0 | 48.0 |
# 数据是否缺失
data.isnull().head()
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 0 | True | False | False | False |
| 1 | True | True | False | False |
| 2 | False | False | False | False |
| 3 | False | False | False | False |
| 4 | False | False | False | False |
# 数据是否未缺失
data.notnull().head()
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 0 | False | True | True | True |
| 1 | False | False | True | True |
| 2 | True | True | True | True |
| 3 | True | True | True | True |
| 4 | True | True | True | True |
data
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 0 | NaN | 蒜蓉生蚝 | 1.0 | 49.0 |
| 1 | NaN | NaN | 1.0 | 48.0 |
| 2 | 417.0 | 大蒜苋菜 | 1.0 | 30.0 |
| 3 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 4 | 417.0 | 蒜香包 | 1.0 | 13.0 |
| 5 | NaN | NaN | NaN | NaN |
| 6 | 301.0 | 香烤牛排\r\n | 1.0 | 55.0 |
| 7 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 8 | 301.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 9 | 301.0 | 番茄有机花菜 | 1.0 | 32.0 |
| 10 | 417.0 | 蒙古烤羊腿\r\n\r\n\r\n | 1.0 | 48.0 |
# axis 删除轴向
# how 删除形式 any、all
# subset 指定删除依据列,可出入多个
data.dropna(axis=0,how='all',
subset=['dishes_name'])
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 0 | NaN | 蒜蓉生蚝 | 1.0 | 49.0 |
| 2 | 417.0 | 大蒜苋菜 | 1.0 | 30.0 |
| 3 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 4 | 417.0 | 蒜香包 | 1.0 | 13.0 |
| 6 | 301.0 | 香烤牛排\r\n | 1.0 | 55.0 |
| 7 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 8 | 301.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 9 | 301.0 | 番茄有机花菜 | 1.0 | 32.0 |
| 10 | 417.0 | 蒙古烤羊腿\r\n\r\n\r\n | 1.0 | 48.0 |
data.head()
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 0 | NaN | 蒜蓉生蚝 | 1.0 | 49.0 |
| 1 | NaN | NaN | 1.0 | 48.0 |
| 2 | 417.0 | 大蒜苋菜 | 1.0 | 30.0 |
| 3 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 4 | 417.0 | 蒜香包 | 1.0 | 13.0 |
# 指定值填充缺失值
data.fillna(13)
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 0 | 13.0 | 蒜蓉生蚝 | 1.0 | 49.0 |
| 1 | 13.0 | 13 | 1.0 | 48.0 |
| 2 | 417.0 | 大蒜苋菜 | 1.0 | 30.0 |
| 3 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 4 | 417.0 | 蒜香包 | 1.0 | 13.0 |
| 5 | 13.0 | 13 | 13.0 | 13.0 |
| 6 | 301.0 | 香烤牛排\r\n | 1.0 | 55.0 |
| 7 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 8 | 301.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 9 | 301.0 | 番茄有机花菜 | 1.0 | 32.0 |
| 10 | 417.0 | 蒙古烤羊腿\r\n\r\n\r\n | 1.0 | 48.0 |
# 使用表格中的数据填充缺失值
# 自上向下填充
data.fillna(method='ffill')
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 0 | NaN | 蒜蓉生蚝 | 1.0 | 49.0 |
| 1 | NaN | 蒜蓉生蚝 | 1.0 | 48.0 |
| 2 | 417.0 | 大蒜苋菜 | 1.0 | 30.0 |
| 3 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 4 | 417.0 | 蒜香包 | 1.0 | 13.0 |
| 5 | 417.0 | 蒜香包 | 1.0 | 13.0 |
| 6 | 301.0 | 香烤牛排\r\n | 1.0 | 55.0 |
| 7 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 8 | 301.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 9 | 301.0 | 番茄有机花菜 | 1.0 | 32.0 |
| 10 | 417.0 | 蒙古烤羊腿\r\n\r\n\r\n | 1.0 | 48.0 |
# 使用表格中的数据填充缺失值
# 自下向上填充
data.fillna(method='bfill')
| order_id | dishes_name | counts | amounts | |
|---|---|---|---|---|
| 0 | 417.0 | 蒜蓉生蚝 | 1.0 | 49.0 |
| 1 | 417.0 | 大蒜苋菜 | 1.0 | 48.0 |
| 2 | 417.0 | 大蒜苋菜 | 1.0 | 30.0 |
| 3 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 4 | 417.0 | 蒜香包 | 1.0 | 13.0 |
| 5 | 301.0 | 香烤牛排\r\n | 1.0 | 55.0 |
| 6 | 301.0 | 香烤牛排\r\n | 1.0 | 55.0 |
| 7 | 417.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 8 | 301.0 | 芝麻烤紫菜 | 1.0 | 25.0 |
| 9 | 301.0 | 番茄有机花菜 | 1.0 | 32.0 |
| 10 | 417.0 | 蒙古烤羊腿\r\n\r\n\r\n | 1.0 | 48.0 |
data = pd.read_excel('meal_order_detail.xlsx')
data
| detail_id | order_id | dishes_id | logicprn_name | parent_class_name | dishes_name | itemis_add | counts | amounts | cost | place_order_time | discount_amt | discount_reason | kick_back | add_inprice | add_info | bar_code | picture_file | emp_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | NaN | NaN | 蒜蓉生蚝 | 0 | 1 | 49 | NaN | 2016-08-01 11:05:36 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/104001.jpg | 1442 |
| 1 | 2958 | 417 | 609957 | NaN | NaN | 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n | 0 | 1 | 48 | NaN | 2016-08-01 11:07:07 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/202003.jpg | 1442 |
| 2 | 2961 | 417 | 609950 | NaN | NaN | 大蒜苋菜 | 0 | 1 | 30 | NaN | 2016-08-01 11:07:40 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/303001.jpg | 1442 |
| 3 | 2966 | 417 | 610038 | NaN | NaN | 芝麻烤紫菜 | 0 | 1 | 25 | NaN | 2016-08-01 11:11:11 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/105002.jpg | 1442 |
| 4 | 2968 | 417 | 610003 | NaN | NaN | 蒜香包 | 0 | 1 | 13 | NaN | 2016-08-01 11:11:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/503002.jpg | 1442 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2774 | 6750 | 774 | 610011 | NaN | NaN | 白饭/大碗 | 0 | 1 | 10 | NaN | 2016-08-10 21:56:24 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/601005.jpg | 1138 |
| 2775 | 6742 | 774 | 609996 | NaN | NaN | 牛尾汤 | 0 | 1 | 40 | NaN | 2016-08-10 21:56:48 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/201006.jpg | 1138 |
| 2776 | 6756 | 774 | 609949 | NaN | NaN | 意文柠檬汁 | 0 | 1 | 13 | NaN | 2016-08-10 22:01:52 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/404005.jpg | 1138 |
| 2777 | 6763 | 774 | 610014 | NaN | NaN | 金玉良缘 | 0 | 1 | 30 | NaN | 2016-08-10 22:03:58 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/302003.jpg | 1138 |
| 2778 | 6764 | 774 | 610017 | NaN | NaN | 酸辣藕丁 | 0 | 1 | 33 | NaN | 2016-08-10 22:04:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/302006.jpg | 1138 |
2779 rows × 19 columns
data['amounts'].isnull().sum()
0
plt.boxplot(data['amounts'])
plt.show()
# plt.boxplot?
# 自定义处理异常函数
def deal_with(x):
QU = x.quantile(0.75)
QL = x.quantile(0.25)
IQR = QU - QL
x[(x > (QU + 1.5*IQR)) |
(x < (QL - 1.5*IQR))] = np.nan
return x
data['amounts'].isnull().sum()
0
print(deal_with(data['amounts']).isnull().sum())
173
data = pd.read_excel('meal_order_detail.xlsx')
data.head()
| detail_id | order_id | dishes_id | logicprn_name | parent_class_name | dishes_name | itemis_add | counts | amounts | cost | place_order_time | discount_amt | discount_reason | kick_back | add_inprice | add_info | bar_code | picture_file | emp_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | NaN | NaN | 蒜蓉生蚝 | 0 | 1 | 49 | NaN | 2016-08-01 11:05:36 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/104001.jpg | 1442 |
| 1 | 2958 | 417 | 609957 | NaN | NaN | 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n | 0 | 1 | 48 | NaN | 2016-08-01 11:07:07 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/202003.jpg | 1442 |
| 2 | 2961 | 417 | 609950 | NaN | NaN | 大蒜苋菜 | 0 | 1 | 30 | NaN | 2016-08-01 11:07:40 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/303001.jpg | 1442 |
| 3 | 2966 | 417 | 610038 | NaN | NaN | 芝麻烤紫菜 | 0 | 1 | 25 | NaN | 2016-08-01 11:11:11 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/105002.jpg | 1442 |
| 4 | 2968 | 417 | 610003 | NaN | NaN | 蒜香包 | 0 | 1 | 13 | NaN | 2016-08-01 11:11:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/503002.jpg | 1442 |
# 自定义函数实现离差标准化
def deviation_standar(x):
return (x-x.min())/(x.max()-x.min())
deviation_standar(data[['counts','amounts']])
| counts | amounts | |
|---|---|---|
| 0 | 0.0 | 0.271186 |
| 1 | 0.0 | 0.265537 |
| 2 | 0.0 | 0.163842 |
| 3 | 0.0 | 0.135593 |
| 4 | 0.0 | 0.067797 |
| ... | ... | ... |
| 2774 | 0.0 | 0.050847 |
| 2775 | 0.0 | 0.220339 |
| 2776 | 0.0 | 0.067797 |
| 2777 | 0.0 | 0.163842 |
| 2778 | 0.0 | 0.180791 |
2779 rows × 2 columns
data.head()
| detail_id | order_id | dishes_id | logicprn_name | parent_class_name | dishes_name | itemis_add | counts | amounts | cost | place_order_time | discount_amt | discount_reason | kick_back | add_inprice | add_info | bar_code | picture_file | emp_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | NaN | NaN | 蒜蓉生蚝 | 0 | 1 | 49 | NaN | 2016-08-01 11:05:36 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/104001.jpg | 1442 |
| 1 | 2958 | 417 | 609957 | NaN | NaN | 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n | 0 | 1 | 48 | NaN | 2016-08-01 11:07:07 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/202003.jpg | 1442 |
| 2 | 2961 | 417 | 609950 | NaN | NaN | 大蒜苋菜 | 0 | 1 | 30 | NaN | 2016-08-01 11:07:40 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/303001.jpg | 1442 |
| 3 | 2966 | 417 | 610038 | NaN | NaN | 芝麻烤紫菜 | 0 | 1 | 25 | NaN | 2016-08-01 11:11:11 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/105002.jpg | 1442 |
| 4 | 2968 | 417 | 610003 | NaN | NaN | 蒜香包 | 0 | 1 | 13 | NaN | 2016-08-01 11:11:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/503002.jpg | 1442 |
# 自定义函数实现标准差标准化
def standarzation_standar(x):
return (x-x.mean())/x.std()
standarzation_standar(data[['counts','amounts']])
| counts | amounts | |
|---|---|---|
| 0 | -0.177784 | 0.099510 |
| 1 | -0.177784 | 0.072343 |
| 2 | -0.177784 | -0.416674 |
| 3 | -0.177784 | -0.552512 |
| 4 | -0.177784 | -0.878523 |
| ... | ... | ... |
| 2774 | -0.177784 | -0.960026 |
| 2775 | -0.177784 | -0.144998 |
| 2776 | -0.177784 | -0.878523 |
| 2777 | -0.177784 | -0.416674 |
| 2778 | -0.177784 | -0.335171 |
2779 rows × 2 columns
data.head()
| detail_id | order_id | dishes_id | logicprn_name | parent_class_name | dishes_name | itemis_add | counts | amounts | cost | place_order_time | discount_amt | discount_reason | kick_back | add_inprice | add_info | bar_code | picture_file | emp_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | NaN | NaN | 蒜蓉生蚝 | 0 | 1 | 49 | NaN | 2016-08-01 11:05:36 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/104001.jpg | 1442 |
| 1 | 2958 | 417 | 609957 | NaN | NaN | 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n | 0 | 1 | 48 | NaN | 2016-08-01 11:07:07 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/202003.jpg | 1442 |
| 2 | 2961 | 417 | 609950 | NaN | NaN | 大蒜苋菜 | 0 | 1 | 30 | NaN | 2016-08-01 11:07:40 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/303001.jpg | 1442 |
| 3 | 2966 | 417 | 610038 | NaN | NaN | 芝麻烤紫菜 | 0 | 1 | 25 | NaN | 2016-08-01 11:11:11 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/105002.jpg | 1442 |
| 4 | 2968 | 417 | 610003 | NaN | NaN | 蒜香包 | 0 | 1 | 13 | NaN | 2016-08-01 11:11:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/503002.jpg | 1442 |
# 自定义函数实现小数定标标准化
def decimal_standar(x):
k = np.ceil(np.log10(x.abs().max()))
return x/k
decimal_standar(data[['counts','amounts']])
| counts | amounts | |
|---|---|---|
| 0 | 1.0 | 16.333333 |
| 1 | 1.0 | 16.000000 |
| 2 | 1.0 | 10.000000 |
| 3 | 1.0 | 8.333333 |
| 4 | 1.0 | 4.333333 |
| ... | ... | ... |
| 2774 | 1.0 | 3.333333 |
| 2775 | 1.0 | 13.333333 |
| 2776 | 1.0 | 4.333333 |
| 2777 | 1.0 | 10.000000 |
| 2778 | 1.0 | 11.000000 |
2779 rows × 2 columns
data.head()
| detail_id | order_id | dishes_id | logicprn_name | parent_class_name | dishes_name | itemis_add | counts | amounts | cost | place_order_time | discount_amt | discount_reason | kick_back | add_inprice | add_info | bar_code | picture_file | emp_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | NaN | NaN | 蒜蓉生蚝 | 0 | 1 | 49 | NaN | 2016-08-01 11:05:36 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/104001.jpg | 1442 |
| 1 | 2958 | 417 | 609957 | NaN | NaN | 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n | 0 | 1 | 48 | NaN | 2016-08-01 11:07:07 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/202003.jpg | 1442 |
| 2 | 2961 | 417 | 609950 | NaN | NaN | 大蒜苋菜 | 0 | 1 | 30 | NaN | 2016-08-01 11:07:40 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/303001.jpg | 1442 |
| 3 | 2966 | 417 | 610038 | NaN | NaN | 芝麻烤紫菜 | 0 | 1 | 25 | NaN | 2016-08-01 11:11:11 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/105002.jpg | 1442 |
| 4 | 2968 | 417 | 610003 | NaN | NaN | 蒜香包 | 0 | 1 | 13 | NaN | 2016-08-01 11:11:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/503002.jpg | 1442 |
# 实现哑变量处理
pd.get_dummies(data[['amounts','dishes_name']])
| amounts | dishes_name_ 42度海之蓝 | dishes_name_ 北冰洋汽水 | dishes_name_38度剑南春 | dishes_name_50度古井贡酒 | dishes_name_52度泸州老窖 | dishes_name_53度茅台 | dishes_name_一品香酥藕 | dishes_name_三丝鳝鱼 | dishes_name_三色凉拌手撕兔 | ... | dishes_name_香辣腐乳炒虾 | dishes_name_香酥两吃大虾 | dishes_name_鱼香肉丝拌面 | dishes_name_鲜美鳝鱼 | dishes_name_鸡蛋、肉末肠粉 | dishes_name_麻辣小龙虾 | dishes_name_黄尾袋鼠西拉子红葡萄酒 | dishes_name_黄油曲奇饼干 | dishes_name_黄花菜炒木耳 | dishes_name_黑米恋上葡萄 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 48 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2774 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2775 | 40 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2776 | 13 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2777 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2778 | 33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2779 rows × 155 columns
data.head()
| detail_id | order_id | dishes_id | logicprn_name | parent_class_name | dishes_name | itemis_add | counts | amounts | cost | place_order_time | discount_amt | discount_reason | kick_back | add_inprice | add_info | bar_code | picture_file | emp_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2956 | 417 | 610062 | NaN | NaN | 蒜蓉生蚝 | 0 | 1 | 49 | NaN | 2016-08-01 11:05:36 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/104001.jpg | 1442 |
| 1 | 2958 | 417 | 609957 | NaN | NaN | 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n | 0 | 1 | 48 | NaN | 2016-08-01 11:07:07 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/202003.jpg | 1442 |
| 2 | 2961 | 417 | 609950 | NaN | NaN | 大蒜苋菜 | 0 | 1 | 30 | NaN | 2016-08-01 11:07:40 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/303001.jpg | 1442 |
| 3 | 2966 | 417 | 610038 | NaN | NaN | 芝麻烤紫菜 | 0 | 1 | 25 | NaN | 2016-08-01 11:11:11 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/105002.jpg | 1442 |
| 4 | 2968 | 417 | 610003 | NaN | NaN | 蒜香包 | 0 | 1 | 13 | NaN | 2016-08-01 11:11:30 | NaN | NaN | NaN | 0 | NaN | NaN | caipu/503002.jpg | 1442 |
# data['amounts'] 表示需要连续化的数据
# 5 表示连续化的区间个数
pd.cut(data['amounts'],5)
0 (36.4, 71.8]
1 (36.4, 71.8]
2 (0.823, 36.4]
3 (0.823, 36.4]
4 (0.823, 36.4]
...
2774 (0.823, 36.4]
2775 (36.4, 71.8]
2776 (0.823, 36.4]
2777 (0.823, 36.4]
2778 (0.823, 36.4]
Name: amounts, Length: 2779, dtype: category
Categories (5, interval[float64, right]): [(0.823, 36.4] < (36.4, 71.8] < (71.8, 107.2] < (107.2, 142.6] < (142.6, 178.0]]